
(CVPR 2018) Data Distillation:Towards Omni-Supervised Learning

Radosavovic I, Dollár P, Girshick R, et al. Data distillation: Towards omni-supervised learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4119-4128.

1. Overview

1.1. Motivation

  • Semi-supervised simulates labeled/unlabeled data (upper-bounded on full annotated data)
  • Omni-supervised exploits extra unlabeled data (lower-bounded on full annotated data)
  • Model distillation distills knowledge from the prediction of multiple model

In this paper

  • it proposed data distillation which ensembles prediction from multiple transformations of unlabeled data, using single model
  • Do experiments on keypoint detection and object detection

1.2. Knowledge Distillation

  1. Train teacher model on large amount of labeled data (A)
  2. Generate annotation on unlabeled data (B) based on teacher model
  3. Retrain student model on data (A+B)
  • Problem. training model on its own prediction can not provide meaningful information
  • Solution. ensembling the prediction of the different transformation of unlabeled data (flipping, scaling)
  • Ensemble multiple model, Model Compression
  • FitNet
  • Cross modal distillation
  • Multi-view geometry
  • Auto-encoder (multiple capsule)

  • Self-training can be used for training object detection

  • Multiple views or perturbations of the data can provide useful signal for semi-supervised learning
    The method of this paper are also based on multiple geometric ransformations.

2. Data Distillation


  1. Train teacher model on labeled data (A)
  2. Apply teacher model to multiple transformation of unlabeled data (B)
  3. Ensemble the multiple prediction to get annotation
  4. Retrain student model on data (A+B)

2.1. Multi-transform

  • multi-crop
  • multi-scale
    In the experiment of this paper, it used multi-scale and flipping.

2.2. Ensemble

  • Aggregated prediction generate new knowledge
  • Aggregated prediction outperform any single prediction

2.3. Ensemble way

2.3.1. Soft Label

  • average class probability
  • generate probability vector, not category label
  • not suitablefor structure output space (pose, detection)

2.3.2. Hard Label

  • need task specific logic (NMS for merging multiple box)

3. Detail of Pose Estimation

3.1. Selecting Prediction

Generate annotation only from the prediction that are above a certain score threshold. And found that

  • the average number of annotated instances per unlabeled image equal to labeled image’s (similar distribution)
  • still work robust and well when not equal

3.2. Retraining

  • retraining is better than fine-tuning (which is in a poor optimum).

4. Experiments

4.1. Data split

  • co-80. labeled
  • co-35. labeled
  • co-115. co-80 + co-35
  • un-120. unlabeled
  • s1m-180 (sports-1M static frame). dissimilar distribution

4.2. Amount of Annotated Data

  • 1:ρ in minibach.

4.3. Accuracy of Teacher Model

4.4. Result